Word Embeddings and Sentiment Analysis¶

University Assignment: Explore the use of word embeddings and sentiment analysis techniques in natural language processing (NLP). Use a dataset of movie reviews to create a word embedding model using Word2Vec and evaluate the performance of the model in sentiment analysis tasks.

Dataset Description¶

Dataset: IMDb Movie Reviews The dataset consists of movie reviews from the IMDb website, along with their corresponding sentiment labels (positive or negative). The dataset is divided into a training set and a test set, with 25,000 reviews in each set.

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
#read dataset
import pandas as pd
df = pd.read_csv('IMDB Dataset.csv')
df.head()
Out[2]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [3]:
df.shape 
Out[3]:
(50000, 2)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
In [5]:
print(
    "This dataset has two variables: review and sentiment.\n"
    "Review is the movie reviews on IMDB\n"
    "Sentiment is either positive or negative sentiment that the review is expressing"
)
This dataset has two variables: review and sentiment.
Review is the movie reviews on IMDB
Sentiment is either positive or negative sentiment that the review is expressing
In [6]:
#!pip install nltk
In [7]:
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
[nltk_data] Downloading package stopwords to /Users/kiko/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kiko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
In [8]:
# Split the dataframe into training and testing sets
X= df['review']
y=df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X,y , random_state=104,test_size=0.25)
train_df = X_train.to_frame()
test_df = X_test.to_frame()
In [9]:
train_df.head()
Out[9]:
review
11681 Oh just what I needed,another movie about 19th...
24009 I saw this only because my 10-yr-old was bored...
40502 The show itself basically reflects the typical...
755 A well-made run-of-the-mill movie with a tragi...
26143 I just bought this movie yesterday night, and ...
In [10]:
test_df.head()
Out[10]:
review
39550 I went into this film expecting it to be simil...
11244 Funny that I find myself forced to review this...
40728 This film is really really bad, it is not very...
40580 Now, I haven't read the original short story t...
46371 I suppose I should be fair and point out that ...
In [11]:
def clean_text(df):

    # change to lower and remove spaces on either side
    df['clean_col'] = df['review'].apply(lambda x: x.lower().strip())

    # remove extra spaces in between
    df['clean_col'] = df['clean_col'].apply(lambda x: re.sub(' +', ' ', x))

    # remove punctuation
    df['clean_col'] = df['clean_col'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
    
    # remove digits
    df['clean_col'] = df['clean_col'].apply(lambda x: re.sub(r'[0-9]+', ' ', x))  

    # remove stopwords
    df['clean_col'] = df['clean_col'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
    
    #tokenize
    df['clean_col'] = df['clean_col'].apply(lambda x: word_tokenize(x))
    
    return df
In [12]:
#clean train and test dataset
clean_text(train_df)
Out[12]:
review clean_col
11681 Oh just what I needed,another movie about 19th... [oh, needed, another, movie, th, century, engl...
24009 I saw this only because my 10-yr-old was bored... [saw, yr, old, bored, friend, hated, course, l...
40502 The show itself basically reflects the typical... [show, basically, reflects, typical, nature, a...
755 A well-made run-of-the-mill movie with a tragi... [well, made, run, mill, movie, tragic, ending,...
26143 I just bought this movie yesterday night, and ... [bought, movie, yesterday, night, love, everyo...
... ... ...
31240 (Some spoilers included:)<br /><br />Although,... [spoilers, included, br, br, although, many, c...
40664 This movie had very few moments of real drama.... [movie, moments, real, drama, opening, minutes...
39078 The third film in a cycle of incomparably bril... [third, film, cycle, incomparably, brilliant, ...
49881 Definitely an odd debut for Michael Madsen. Ma... [definitely, odd, debut, michael, madsen, mads...
8261 Actually my vote is a 7.5. Anyway, the movie w... [actually, vote, anyway, movie, good, funny, p...

37500 rows × 2 columns

In [13]:
clean_text(test_df)
Out[13]:
review clean_col
39550 I went into this film expecting it to be simil... [went, film, expecting, similar, matrix, pi, b...
11244 Funny that I find myself forced to review this... [funny, find, forced, review, movie, br, br, r...
40728 This film is really really bad, it is not very... [film, really, really, bad, well, done, lack, ...
40580 Now, I haven't read the original short story t... [read, original, short, story, know, literary,...
46371 I suppose I should be fair and point out that ... [suppose, fair, point, believe, ghosts, said, ...
... ... ...
14015 This is one of the best Fred Astaire-Ginger Ro... [one, best, fred, astaire, ginger, rogers, fil...
15507 Police Story is one of Jackie Chan's classic f... [police, story, one, jackie, chan, classic, fi...
31089 This is not a good movie but I still like it. ... [good, movie, still, like, cat, clovis, gold, ...
26840 Red Rock West is one of those rare films that ... [red, rock, west, one, rare, films, keeps, gue...
14029 After seeing this routine by John Leguizamo, I... [seeing, routine, john, leguizamo, finally, re...

12500 rows × 2 columns

Data Preprocessing and Exploration¶

Preprocess the data and explore its characteristics.

Data Cleaning

Load the training set and test set into two separate dataframes. Clean the text data by removing punctuation, digits, and stop words. Tokenize the text data and convert it to lowercase.

Data Exploration

Compute and plot the distribution of the length of the reviews (in terms of the number of words). Compute the frequency of the top-k most common words in the training set.

In [14]:
# Code snippet for data exploration
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

def count_words(df):
    #use this function to count all words
    df['review_length'] = df['clean_col'].apply(len)
    

# Distribution of review lengths
#this plot sohuld have the length of reviews along x axis and the frequency of that length along y axis

#plot for train dataset
count_words(train_df)
sns.distplot(train_df['review_length'],kde=False,bins=50)
Out[14]:
<AxesSubplot:xlabel='review_length'>
In [15]:
#plot review length distribution for test dataset
count_words(test_df)
sns.distplot(test_df['review_length'],kde=False,bins=50)
Out[15]:
<AxesSubplot:xlabel='review_length'>
In [16]:
# Frequency of top-k most common words 
#I am using top-150 words

from collections import Counter
lst = train_df['clean_col'].explode().to_list()
Counter = Counter(lst).most_common(150)
Counter[:10]
Out[16]:
[('br', 151788),
 ('movie', 66139),
 ('film', 59605),
 ('one', 40288),
 ('like', 30140),
 ('good', 22465),
 ('time', 18861),
 ('even', 18592),
 ('would', 18433),
 ('story', 17442)]
In [17]:
#create a list of the top 150 words
top_150_word = []
i = 0
while i < len(Counter):
    top_150_word.append(Counter[i][0])
    i += 1

Word Embedding¶

Create a word embedding model using Word2Vec.

Word2Vec Training

Train a Word2Vec model on the cleaned text data in the training set. Save the trained model to a file for later use.

Word Embedding Visualization

Visualize the embeddings of the top-k most common words in the training set using t-SNE. Visualize the embeddings of the words "good", "bad", "great", and "terrible" using t-SNE.

In [18]:
# Use Word2Vec to create and save your word2vec model
import gensim
from gensim.models import Word2Vec

text = train_df.clean_col.tolist()

# Create Word2Vec model for training set
model = Word2Vec(sentences=text)
In [19]:
# Code snippet for embedding visualization
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np


# Visualize embeddings of top-150 words. Create a 2d embedding array using TSNE and use that for your visualisation

num_components = 2  
model_vector_lst = []
for i in top_150_word:
    model_vector_lst.append(model.wv[i])
    
vectors = np.asarray(model_vector_lst)
labels = np.asarray(top_150_word)  

# apply TSNE 
tsne = TSNE(n_components=num_components, random_state=0)
vectors = tsne.fit_transform(vectors)

x_vals = [v[0] for v in vectors]
y_vals = [v[1] for v in vectors]   


def plot_embeddings(x_vals, y_vals, labels):
    import plotly.graph_objs as go
    fig = go.Figure()
    trace = go.Scatter(x=x_vals, y=y_vals, mode='markers', text=labels)
    fig.add_trace(trace)
    fig.update_layout(title="Word2Vec - Visualization embedding con TSNE")
    fig.show()
    return fig

plot = plot_embeddings(x_vals, y_vals, labels)

#This is an interactive plot. If you hover on the points, you can see the word/label
In [20]:
# Visualize embeddings of specific words. Create a 2d embedding array using TSNE and use that for your visualisation

words_to_visualize = ["good", "bad", "great", "terrible"]
In [21]:
num_components = 2  
model_vector_lst = []
for i in words_to_visualize:
    model_vector_lst.append(model.wv[i])
In [22]:
vectors = np.asarray(model_vector_lst)
labels = np.asarray(words_to_visualize)  

# apply TSNE 
tsne = TSNE(n_components=num_components, random_state=0, perplexity=1)
vectors = tsne.fit_transform(vectors)

x_vals = [v[0] for v in vectors]
y_vals = [v[1] for v in vectors]
In [23]:
plot = plot_embeddings(x_vals, y_vals, labels)

#This is an interactive plot. If you hover on the points, you can see the word/label.

Sentiment Analysis and Evaluation¶

Use the word embeddings to perform sentiment analysis on the test set and evaluate the performance of the model.

Sentiment Analysis

Convert the cleaned text data in the test set to vectors using the trained Word2Vec model. Train a logistic regression model on the vector representations of the text data in the training set. Use the trained logistic regression model to predict the sentiment labels (positive or negative) of the text data in the test set.

Evaluation

Calculate the accuracy, precision, recall, and F1 score of the sentiment analysis model. Visualize the confusion matrix of the sentiment analysis model.

In [24]:
# To make the analysis faster, I am going to convert the whole dataset into vectors first then perform 
# the train/test datasplit. I will be using the same random state as before (random_state=104) so
# the result will be the same.
In [25]:
df.head()
Out[25]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [26]:
clean_text(df)
Out[26]:
review sentiment clean_col
0 One of the other reviewers has mentioned that ... positive [one, reviewers, mentioned, watching, oz, epis...
1 A wonderful little production. <br /><br />The... positive [wonderful, little, production, br, br, filmin...
2 I thought this was a wonderful way to spend ti... positive [thought, wonderful, way, spend, time, hot, su...
3 Basically there's a family where a little boy ... negative [basically, family, little, boy, jake, thinks,...
4 Petter Mattei's "Love in the Time of Money" is... positive [petter, mattei, love, time, money, visually, ...
... ... ... ...
49995 I thought this movie did a down right good job... positive [thought, movie, right, good, job, creative, o...
49996 Bad plot, bad dialogue, bad acting, idiotic di... negative [bad, plot, bad, dialogue, bad, acting, idioti...
49997 I am a Catholic taught in parochial elementary... negative [catholic, taught, parochial, elementary, scho...
49998 I'm going to have to disagree with the previou... negative [going, disagree, previous, comment, side, mal...
49999 No one expects the Star Trek movies to be high... negative [one, expects, star, trek, movies, high, art, ...

50000 rows × 3 columns

In [27]:
df['clean_col2'] = [' '.join(map(str, l)) for l in df['clean_col']]
In [28]:
df.head()
Out[28]:
review sentiment clean_col clean_col2
0 One of the other reviewers has mentioned that ... positive [one, reviewers, mentioned, watching, oz, epis... one reviewers mentioned watching oz episode ho...
1 A wonderful little production. <br /><br />The... positive [wonderful, little, production, br, br, filmin... wonderful little production br br filming tech...
2 I thought this was a wonderful way to spend ti... positive [thought, wonderful, way, spend, time, hot, su... thought wonderful way spend time hot summer we...
3 Basically there's a family where a little boy ... negative [basically, family, little, boy, jake, thinks,... basically family little boy jake thinks zombie...
4 Petter Mattei's "Love in the Time of Money" is... positive [petter, mattei, love, time, money, visually, ... petter mattei love time money visually stunnin...
In [29]:
# Count vectorization of text
from sklearn.feature_extraction.text import CountVectorizer
In [30]:
corpus = df['clean_col2'].values
In [31]:
# Creating the vectorizer
vectorizer = CountVectorizer(stop_words='english')
In [32]:
# Converting the text to numeric data
X = vectorizer.fit_transform(corpus)
In [33]:
# Preparing DataFrame for machine learning
# Priority column acts as the target variable and other columns as predictors
CountVectorizedData = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
In [34]:
# Creating the list of words which are present in the Document term matrix
WordsVocab=CountVectorizedData.columns
 
# Printing sample words
WordsVocab[0:10]
Out[34]:
Index(['aa', 'aaa', 'aaaaaaaaaaaahhhhhhhhhhhhhh', 'aaaaaaaargh', 'aaaaaaah',
       'aaaaaaahhhhhhggg', 'aaaaagh', 'aaaaah', 'aaaaahhhh', 'aaaaargh'],
      dtype='object')
In [35]:
len(WordsVocab)
Out[35]:
99056
In [36]:
# Converting the text to numeric data
X = vectorizer.transform(df['clean_col2'])
In [37]:
CountVecData=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
In [38]:
W2Vec_Data=pd.DataFrame()
In [39]:
# Create loop

for i in range(CountVecData.shape[0]):
    Sentence = np.zeros(100)
 
        # Looping thru each word in the sentence and if its present in 
        # the Word2Vec model then storing its vector
    for word in WordsVocab[CountVecData.iloc[i , :]>=1]:
        
        if word in model.wv.key_to_index.keys():    
            Sentence=Sentence+model.wv[word]
        # Appending the sentence to the dataframe
    W2Vec_Data=W2Vec_Data.append(pd.DataFrame([Sentence]))
In [40]:
W2Vec_Data
Out[40]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 0.845066 21.074142 -27.039258 9.002568 -18.449693 -19.489673 24.202210 13.770300 -63.909698 -30.462318 ... 43.626524 50.205337 -11.977495 -2.651355 63.870902 33.575529 -6.873737 -4.653884 33.407272 -3.028877
0 4.646513 -27.095638 -11.011114 -15.893251 -7.019846 -9.543718 17.137718 32.488901 -29.658915 -11.607785 ... 18.276616 23.801546 11.551250 22.952833 43.372218 34.946268 3.623533 -17.111816 29.811776 -2.583454
0 27.581128 -4.100506 -15.049071 1.995729 -18.761900 -7.315710 37.401362 15.316645 -20.697380 -5.300695 ... 24.438132 35.133365 -9.264058 15.200469 46.843492 23.440720 1.250978 -10.280586 19.974900 13.492671
0 4.278052 5.576408 4.444432 -4.695909 -16.734504 -0.619649 22.183735 6.021529 -20.397884 -14.555054 ... 26.877040 18.771703 -4.897337 6.664134 30.610391 19.250141 4.208647 -3.795059 12.350482 15.395156
0 2.725437 -20.192761 -15.919149 0.168059 -8.541662 -26.854808 26.608201 8.664394 -18.287214 6.231360 ... 27.186128 43.786424 5.449669 19.755477 72.035188 26.304189 5.238519 1.304170 6.507701 21.170385
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0 10.808050 -14.689696 -9.491751 12.295364 -9.635763 0.565626 43.320826 -2.752151 -22.291317 -10.542328 ... -3.435190 27.442312 -15.341893 35.154118 58.849406 30.870952 11.762682 -21.430055 14.595973 22.106617
0 17.146678 3.327781 0.254207 -4.129948 -11.971751 -5.713457 27.006311 15.177977 -8.910854 -24.821422 ... 24.132641 15.240886 -12.247834 10.002461 30.757395 18.018837 6.508978 -10.108581 9.777080 5.976269
0 2.415512 -0.835904 -2.650888 -4.701153 -6.430142 -20.942541 15.420227 -1.658896 -23.537929 -1.566711 ... 44.268204 19.043730 11.429349 11.499594 39.452780 17.482174 -28.298990 2.480149 10.672934 8.731389
0 2.747474 6.171118 1.112332 -0.504889 0.562939 -14.652804 20.821587 35.256244 -20.524770 -6.837540 ... 35.610543 21.499424 -2.286395 6.553049 51.991320 19.468075 11.514976 -11.060579 12.789450 5.820660
0 24.450603 -12.436791 -2.608275 10.086381 -17.740947 -8.711633 32.252910 9.720001 -17.300261 -23.208122 ... -1.322157 21.698793 -11.248874 -0.543811 42.189506 13.664176 9.015829 -11.308207 4.205468 4.607011

50000 rows × 100 columns

In [41]:
W2Vec_Data.shape
Out[41]:
(50000, 100)
In [42]:
#ML on training set
In [43]:
# Adding the target variable
W2Vec_Data.reset_index(inplace=True, drop=True)
In [44]:
W2Vec_Data['sentiment']=df['sentiment']
In [45]:
# Assigning to DataForML variable
DataForML=W2Vec_Data
DataForML.head()
Out[45]:
0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99 sentiment
0 0.845066 21.074142 -27.039258 9.002568 -18.449693 -19.489673 24.202210 13.770300 -63.909698 -30.462318 ... 50.205337 -11.977495 -2.651355 63.870902 33.575529 -6.873737 -4.653884 33.407272 -3.028877 positive
1 4.646513 -27.095638 -11.011114 -15.893251 -7.019846 -9.543718 17.137718 32.488901 -29.658915 -11.607785 ... 23.801546 11.551250 22.952833 43.372218 34.946268 3.623533 -17.111816 29.811776 -2.583454 positive
2 27.581128 -4.100506 -15.049071 1.995729 -18.761900 -7.315710 37.401362 15.316645 -20.697380 -5.300695 ... 35.133365 -9.264058 15.200469 46.843492 23.440720 1.250978 -10.280586 19.974900 13.492671 positive
3 4.278052 5.576408 4.444432 -4.695909 -16.734504 -0.619649 22.183735 6.021529 -20.397884 -14.555054 ... 18.771703 -4.897337 6.664134 30.610391 19.250141 4.208647 -3.795059 12.350482 15.395156 negative
4 2.725437 -20.192761 -15.919149 0.168059 -8.541662 -26.854808 26.608201 8.664394 -18.287214 6.231360 ... 43.786424 5.449669 19.755477 72.035188 26.304189 5.238519 1.304170 6.507701 21.170385 positive

5 rows × 101 columns

In [46]:
# Changing the positive to 1, and negative to 0
DataForML['sentiment'] = DataForML['sentiment'].map({'negative':0,'positive':1})
In [47]:
DataForML.head()
Out[47]:
0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99 sentiment
0 0.845066 21.074142 -27.039258 9.002568 -18.449693 -19.489673 24.202210 13.770300 -63.909698 -30.462318 ... 50.205337 -11.977495 -2.651355 63.870902 33.575529 -6.873737 -4.653884 33.407272 -3.028877 1
1 4.646513 -27.095638 -11.011114 -15.893251 -7.019846 -9.543718 17.137718 32.488901 -29.658915 -11.607785 ... 23.801546 11.551250 22.952833 43.372218 34.946268 3.623533 -17.111816 29.811776 -2.583454 1
2 27.581128 -4.100506 -15.049071 1.995729 -18.761900 -7.315710 37.401362 15.316645 -20.697380 -5.300695 ... 35.133365 -9.264058 15.200469 46.843492 23.440720 1.250978 -10.280586 19.974900 13.492671 1
3 4.278052 5.576408 4.444432 -4.695909 -16.734504 -0.619649 22.183735 6.021529 -20.397884 -14.555054 ... 18.771703 -4.897337 6.664134 30.610391 19.250141 4.208647 -3.795059 12.350482 15.395156 0
4 2.725437 -20.192761 -15.919149 0.168059 -8.541662 -26.854808 26.608201 8.664394 -18.287214 6.231360 ... 43.786424 5.449669 19.755477 72.035188 26.304189 5.238519 1.304170 6.507701 21.170385 1

5 rows × 101 columns

In [48]:
# Split the dataframe into training and testing sets using the same random state as before
X = DataForML.drop('sentiment', axis=1)
y = DataForML['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X,y , random_state=104,test_size=0.25)
In [49]:
X_train.head() #we have the same split as before
Out[49]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
11681 4.570087 11.356616 1.378616 11.755995 -7.752678 -6.863883 26.874491 16.368971 -4.599308 8.510776 ... 17.531065 29.965890 0.394637 14.854334 36.792583 14.869995 14.975973 1.061858 5.724960 15.369798
24009 21.086119 -10.296923 -3.712598 1.085078 -11.816567 -9.119838 40.764881 -0.447825 -17.496902 -14.338426 ... 18.316860 15.392579 -0.441269 7.672108 34.320870 30.942831 7.530459 -8.615611 4.526077 6.234121
40502 -4.762835 -7.835522 -3.633352 -0.749317 -2.904131 -3.663638 15.119360 5.690972 -17.137478 -11.269718 ... 27.155459 29.246034 -0.548935 5.717524 36.694034 15.558508 2.011929 2.138965 4.805004 0.575274
755 0.975955 -8.224445 -4.920710 -3.016067 -15.312504 -4.898792 7.329217 6.384194 -26.553284 -13.656131 ... 16.070822 19.251972 2.319530 -1.982912 25.151508 19.765591 -7.650933 -0.136045 13.125170 17.002416
26143 2.421924 -15.065130 -11.060019 4.496135 -17.723280 8.446882 16.907165 10.839111 -16.144197 -3.672041 ... 26.475772 17.296033 -4.513053 18.116871 39.706810 36.393804 0.474464 -4.611091 11.770387 7.825103

5 rows × 100 columns

In [50]:
X_test.head() #we have the same split as before
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
39550 12.459414 3.956161 2.986157 -11.152775 -21.326900 -11.577348 36.277918 5.847686 -37.068659 -29.201929 ... 30.586074 45.044685 -8.056366 19.965253 73.038487 33.088841 2.399853 -13.107485 38.969429 15.764981
11244 45.670047 -4.157246 -26.522301 -13.792800 8.029035 -41.404722 62.447519 61.581277 -57.570185 -38.053936 ... 74.407756 88.492903 5.269985 18.665334 166.756333 58.480183 33.906816 -48.480392 67.569411 16.121584
40728 9.379913 -11.177137 -0.704772 0.174617 -9.732054 -14.047008 20.401321 0.822430 -17.634648 -26.720074 ... 15.817401 22.173254 13.011449 6.789313 39.143032 22.923904 3.547468 -1.901817 12.155588 5.649701
40580 22.969059 7.293580 2.123036 5.554118 -36.678775 -18.622362 49.615374 -3.282337 -33.056823 -25.475983 ... 28.810939 41.458976 7.458327 26.327205 67.371688 37.125004 1.675617 -17.003198 26.606550 18.196014
46371 14.967359 -6.518351 8.341882 -7.194701 -12.743630 -34.869573 52.300401 1.341762 -42.710234 -19.837030 ... 35.679409 43.943890 1.770937 22.140901 67.812764 23.423750 -4.646741 -1.587722 12.092082 13.814334

5 rows × 100 columns

In [51]:
# Code snippet for sentiment analysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
In [52]:
clf = LogisticRegression(C=10,penalty='l2', solver='newton-cg')
In [53]:
LOG=clf.fit(X_train,y_train)
In [54]:
# Generating predictions on testing data
prediction=LOG.predict(X_test)
In [55]:
# Measuring accuracy on Testing Data

# Shows precision, recall, F1 score, support, accuracy and lastly, confusion matrix

from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(prediction, y_test))
              precision    recall  f1-score   support

           0       0.86      0.86      0.86      6233
           1       0.86      0.86      0.86      6267

    accuracy                           0.86     12500
   macro avg       0.86      0.86      0.86     12500
weighted avg       0.86      0.86      0.86     12500

[[5340  861]
 [ 893 5406]]
In [56]:
## Printing the Overall Accuracy of the model
F1_Score=metrics.f1_score(y_test, prediction, average='weighted')
print('Accuracy of the model on Testing Sample Data:', round(F1_Score,2))
Accuracy of the model on Testing Sample Data: 0.86